(Ford GoBikes System 2019-02 Dataset)

by (Abdullah Alsuwailem)

Preliminary Wrangling

This data set is taken from https://www.fordgobike.com/system-data and represents trips taken by members of the service for month of February of 2019.
Data consists of info about trips taken by service's members, their types, their age, their gender, stations of starting and ending trips, duration of trips etc.

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
pd.options.plotting.backend = "plotly"
pd.options.mode.chained_assignment = None
pio.templates.default = 'plotly_dark'

Load in my dataset and describe its properties through the questions below. I will try to motivate my exploration goals.

In [2]:
df = pd.read_csv("201902-fordgobike-tripdata.csv")
In [3]:
df.head(8)
Out[3]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 52185 2019-02-28 17:32:10.1450 2019-03-01 08:01:55.9750 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984.0 Male No
1 42521 2019-02-28 18:53:21.7890 2019-03-01 06:42:03.0560 23.0 The Embarcadero at Steuart St 37.791464 -122.391034 81.0 Berry St at 4th St 37.775880 -122.393170 2535 Customer NaN NaN No
2 61854 2019-02-28 12:13:13.2180 2019-03-01 05:24:08.1460 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972.0 Male No
3 36490 2019-02-28 17:54:26.0100 2019-03-01 04:02:36.8420 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989.0 Other No
4 1585 2019-02-28 23:54:18.5490 2019-03-01 00:20:44.0740 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974.0 Male Yes
5 1793 2019-02-28 23:49:58.6320 2019-03-01 00:19:51.7600 93.0 4th St at Mission Bay Blvd S 37.770407 -122.391198 323.0 Broadway at Kearny 37.798014 -122.405950 5200 Subscriber 1959.0 Male No
6 1147 2019-02-28 23:55:35.1040 2019-03-01 00:14:42.5880 300.0 Palm St at Willow St 37.317298 -121.884995 312.0 San Jose Diridon Station 37.329732 -121.901782 3803 Subscriber 1983.0 Female No
7 1615 2019-02-28 23:41:06.7660 2019-03-01 00:08:02.7560 10.0 Washington St at Kearny St 37.795393 -122.404770 127.0 Valencia St at 21st St 37.756708 -122.421025 6329 Subscriber 1989.0 Male No
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             183412 non-null  int64  
 1   start_time               183412 non-null  object 
 2   end_time                 183412 non-null  object 
 3   start_station_id         183215 non-null  float64
 4   start_station_name       183215 non-null  object 
 5   start_station_latitude   183412 non-null  float64
 6   start_station_longitude  183412 non-null  float64
 7   end_station_id           183215 non-null  float64
 8   end_station_name         183215 non-null  object 
 9   end_station_latitude     183412 non-null  float64
 10  end_station_longitude    183412 non-null  float64
 11  bike_id                  183412 non-null  int64  
 12  user_type                183412 non-null  object 
 13  member_birth_year        175147 non-null  float64
 14  member_gender            175147 non-null  object 
 15  bike_share_for_all_trip  183412 non-null  object 
dtypes: float64(7), int64(2), object(7)
memory usage: 22.4+ MB
In [5]:
df.describe()
Out[5]:
duration_sec start_station_id start_station_latitude start_station_longitude end_station_id end_station_latitude end_station_longitude bike_id member_birth_year
count 183412.000000 183215.000000 183412.000000 183412.000000 183215.000000 183412.000000 183412.000000 183412.000000 175147.000000
mean 726.078435 138.590427 37.771223 -122.352664 136.249123 37.771427 -122.352250 4472.906375 1984.806437
std 1794.389780 111.778864 0.099581 0.117097 111.515131 0.099490 0.116673 1664.383394 10.116689
min 61.000000 3.000000 37.317298 -122.453704 3.000000 37.317298 -122.453704 11.000000 1878.000000
25% 325.000000 47.000000 37.770083 -122.412408 44.000000 37.770407 -122.411726 3777.000000 1980.000000
50% 514.000000 104.000000 37.780760 -122.398285 100.000000 37.781010 -122.398279 4958.000000 1987.000000
75% 796.000000 239.000000 37.797280 -122.286533 235.000000 37.797320 -122.288045 5502.000000 1992.000000
max 85444.000000 398.000000 37.880222 -121.874119 398.000000 37.880222 -121.874119 6645.000000 2001.000000

Data Cleaning

Since I'm using the fordbike data, I need to clean it before I start to visualize it.

In [6]:
# Let's make a copy of the dataframe to diffrentiate between the cleaned data and the original data
fgb_clean = df.copy() # fgb_clean = fordgobike cleaned data  which creates a copy of the original dataframe
In [7]:
fgb_clean.isna().sum()
Out[7]:
duration_sec                  0
start_time                    0
end_time                      0
start_station_id            197
start_station_name          197
start_station_latitude        0
start_station_longitude       0
end_station_id              197
end_station_name            197
end_station_latitude          0
end_station_longitude         0
bike_id                       0
user_type                     0
member_birth_year          8265
member_gender              8265
bike_share_for_all_trip       0
dtype: int64
In [8]:
fgb_clean = fgb_clean.dropna()
fgb_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             174952 non-null  int64  
 1   start_time               174952 non-null  object 
 2   end_time                 174952 non-null  object 
 3   start_station_id         174952 non-null  float64
 4   start_station_name       174952 non-null  object 
 5   start_station_latitude   174952 non-null  float64
 6   start_station_longitude  174952 non-null  float64
 7   end_station_id           174952 non-null  float64
 8   end_station_name         174952 non-null  object 
 9   end_station_latitude     174952 non-null  float64
 10  end_station_longitude    174952 non-null  float64
 11  bike_id                  174952 non-null  int64  
 12  user_type                174952 non-null  object 
 13  member_birth_year        174952 non-null  float64
 14  member_gender            174952 non-null  object 
 15  bike_share_for_all_trip  174952 non-null  object 
dtypes: float64(7), int64(2), object(7)
memory usage: 22.7+ MB

I want to change duration_sec column to duration_hours to be more clear in the data

In [9]:
fgb_clean['duration_sec'] = fgb_clean['duration_sec'] / (60 * 60) # Change from sec to min to hours
fgb_clean.rename(columns={'duration_sec': 'duration_hour'}, inplace=True)
fgb_clean['duration_hour']
Out[9]:
0         14.495833
2         17.181667
3         10.136111
4          0.440278
5          0.498056
            ...    
183407     0.133333
183408     0.086944
183409     0.039167
183410     0.038611
183411     0.075278
Name: duration_hour, Length: 174952, dtype: float64

As I learned in the previous project, I have to convert datetime of (start_time,end_time) from Object to to datetime64.

In [10]:
fgb_clean["start_time"] = fgb_clean["start_time"].astype("datetime64[ns]")
fgb_clean["end_time"] = fgb_clean["end_time"].astype("datetime64[ns]")
fgb_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_hour            174952 non-null  float64       
 1   start_time               174952 non-null  datetime64[ns]
 2   end_time                 174952 non-null  datetime64[ns]
 3   start_station_id         174952 non-null  float64       
 4   start_station_name       174952 non-null  object        
 5   start_station_latitude   174952 non-null  float64       
 6   start_station_longitude  174952 non-null  float64       
 7   end_station_id           174952 non-null  float64       
 8   end_station_name         174952 non-null  object        
 9   end_station_latitude     174952 non-null  float64       
 10  end_station_longitude    174952 non-null  float64       
 11  bike_id                  174952 non-null  int64         
 12  user_type                174952 non-null  object        
 13  member_birth_year        174952 non-null  float64       
 14  member_gender            174952 non-null  object        
 15  bike_share_for_all_trip  174952 non-null  object        
dtypes: datetime64[ns](2), float64(8), int64(1), object(5)
memory usage: 22.7+ MB
In [11]:
# Remove unneeded columns from the datset
fgb_clean = fgb_clean.drop(["start_station_id", "start_station_latitude", "start_station_longitude", "end_station_id", 
              "end_station_latitude", "end_station_longitude"], axis=1)
fgb_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 10 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_hour            174952 non-null  float64       
 1   start_time               174952 non-null  datetime64[ns]
 2   end_time                 174952 non-null  datetime64[ns]
 3   start_station_name       174952 non-null  object        
 4   end_station_name         174952 non-null  object        
 5   bike_id                  174952 non-null  int64         
 6   user_type                174952 non-null  object        
 7   member_birth_year        174952 non-null  float64       
 8   member_gender            174952 non-null  object        
 9   bike_share_for_all_trip  174952 non-null  object        
dtypes: datetime64[ns](2), float64(2), int64(1), object(5)
memory usage: 14.7+ MB
In [12]:
# Remove Other gender to keep only Male and Female
fgb_clean = fgb_clean[fgb_clean.member_gender != 'Other']
fgb_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 171305 entries, 0 to 183411
Data columns (total 10 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_hour            171305 non-null  float64       
 1   start_time               171305 non-null  datetime64[ns]
 2   end_time                 171305 non-null  datetime64[ns]
 3   start_station_name       171305 non-null  object        
 4   end_station_name         171305 non-null  object        
 5   bike_id                  171305 non-null  int64         
 6   user_type                171305 non-null  object        
 7   member_birth_year        171305 non-null  float64       
 8   member_gender            171305 non-null  object        
 9   bike_share_for_all_trip  171305 non-null  object        
dtypes: datetime64[ns](2), float64(2), int64(1), object(5)
memory usage: 14.4+ MB
In [13]:
# Create a column that display days
fgb_clean.insert(0,"Day", fgb_clean["start_time"].dt.day_name())
In [14]:
# Create TimeOfDay to show if the start time is day or night.
fgb_clean['TimeOfDay'] = 'Day'
fgb_clean['TimeOfDay'][(fgb_clean['start_time'].dt.hour.between(17,24) |
                        (fgb_clean['start_time'].dt.hour.between(0,5)))]= 'Night'
fgb_clean['TimeOfDay']
Out[14]:
0         Night
2           Day
4         Night
5         Night
6         Night
          ...  
183407    Night
183408    Night
183409    Night
183410    Night
183411    Night
Name: TimeOfDay, Length: 171305, dtype: object

What is the structure of your dataset?

The structure of the dataset is composed of 10 columns and each one of them gives an information about the dataset. Here are the columns (duration_hour, start_time, end_time, start_station_name, end_station_name, bike_id, user_type, member_birth_year, member_gender, bike_share_for_all_trip)

What are the main features of interest in your dataset?

1- Count the most usual time for ride to start.
2- Show the most common end stations that people ends on.
3- Show the entire date using start_time.
4- Show member's gender percentages.
5- Show user's type percentages.
6- Caluclating the maximum duration hour and display its day.
7- Show the count of birth_date depending on their duration_hours.
8- Show the birth_date depending on their maximum duration_hour.
9- Show the top 5 End stations by day of week with Females only.
10- Show top 5 End stations by TimeOfDay with Customers only.
11- Scatter the bike sharing of males and females during Tuesday.

What features in the dataset do you think will help support your investigation into your features of interest?

After I cleaned the data and dropped unnecessary columns, I'll be using almost all the rest of columns to help me with the investigation into my features of interest.

Univariate Exploration

What is the most usual time for a rider to start ?

In [15]:
fgb_clean["start_time"].dt.hour.hist(legend = True)

As its seen in the graph, the most usual time for a rider to start is 17 (5 PM) and after that is (8 AM)

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I needed to change the datetime of (start_time) column to ensure that the datetime type is changed from Object to datetime64[ns]. then I used fgb_clean["start_time"].dt.hour to make the x axis locates the time in hours instead of the whole date.

Show the graph of the most common end stations that people ends on

Now, I will graph the top 5 end stations in the dataset

In [33]:
counter = fgb_clean['end_station_name'].value_counts()
top_stations = fgb_clean.loc[fgb_clean['end_station_name'].isin(counter.index[0:5])]
fig = top_stations['end_station_name'].value_counts().plot(kind='barh');
fig.update_layout(
    title="Top 5 end stations",
    yaxis_title="(End station names)",
    xaxis_title="(Value)",
    legend_title="variable",
    font=dict(
        family="Courier New, monospace",
        color="lightblue"
    )
)
fig.update_traces(marker_color='grey')
fig.show()

Show a visualization of the entire date by using start_time

Let's see start time during the whole date in the dataset

In [17]:
fig = fgb_clean["start_time"].dt.date.plot(kind = 'hist')
fig.update_layout(
    title="Start time of the entire date",
    xaxis_title="Date",
    legend_title="variable",
    font=dict(
        family="Courier New, monospace",
    )
)
fig.update_traces(marker_color='brown')
fig.show()

Another visualization

In [18]:
fgb_clean['start_time'].dt.month_name().hist()

Why the previous two visualizations showed only February month ?

because I mentioned in the introduction that the dataset contains February month only, unlike other datasets that some of them shows the entire year.

Show member's gender percentages

In [19]:
labels = list(fgb_clean['member_gender'].value_counts().index)
values = list(fgb_clean['member_gender'].value_counts().values)
fig = go.Figure(data=[go.Pie(labels=labels, values=values, pull=[0, 0.2])]) # pull means pull Female from pie chart by 0.2
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(title = 'Male Vs Female')
fig.show()

Males are very high with a 76.2% but Females are very few with a 23.8%

Show user_type's percentages

In [20]:
labels = list(fgb_clean['user_type'].value_counts().index)
values = list(fgb_clean['user_type'].value_counts().values)
fig = go.Figure(data=[go.Pie(labels=labels, values=values, pull=[0, 0])]) # without pull
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(title = 'Subscribers Vs Customers')
fig.show()

Bivariate Exploration

What is the maximum duration hour and in which day ?

In [21]:
fig = fgb_clean.plot(kind='scatter', x = 'Day', y = 'duration_hour');
fig.update_layout(
    title="Day / Duration hours",
    xaxis_title="Day",
    yaxis_title="duration_hour",
    font=dict(
        family="Courier New, monospace",
        color="lightblue"
    )
)
fig.update_traces(marker_color='lightyellow')
fig.show()

In the graph it shows that Saturday has the highest point, clicking the highest point shows that the duration_hour = 23.485556

Now let's check if it's true or not

In [22]:
max_row = fgb_clean['duration_hour'].idxmax() # get the index of maximum duration hour
fgb_clean.loc[max_row] # locate the row with the maximum duration_hour
Out[22]:
Day                                                            Saturday
duration_hour                                                 23.485556
start_time                                   2019-02-16 15:48:25.029000
end_time                                     2019-02-17 15:17:33.080000
start_station_name         Powell St BART Station (Market St at 4th St)
end_station_name                                   Myrtle St at Polk St
bike_id                                                            6301
user_type                                                    Subscriber
member_birth_year                                                1981.0
member_gender                                                      Male
bike_share_for_all_trip                                              No
TimeOfDay                                                           Day
Name: 85465, dtype: object

Yes I found that the maximum duration hour is 23.485556 and the day is Saturday.

Talk about some of the relationships you observed in this part of the investigation. How did the features of interest vary with other features in the dataset?

In this dataset, I needed to insert a new column called "Day" which consist of the days of week (Sunday,Monday ... Saturday) after that I used scatter plot (by plotly library) to display the duration_hours of every day of the week. So In the graph I can clearly see the maximum duration hour by looking at the highest point.

Show a graph that displays the count of birth_date depending on their duration_hours

In [23]:
# Note: This is Bivariate because I used two variables 'member_birth_year' and 'duration_hour' with an aggregate function


fig = fgb_clean.groupby('member_birth_year')['duration_hour'].agg(['count']).plot(kind = 'line', y = 'count')
fig.update_layout(
    title="The count of member birth date",
    font=dict(
        family="Courier New, monospace",
    )
)
fig.show()

The graph shows that between 1980 and 1995 is the pinnacle. However it got a massive decrease before it reached 2000, that's because the number of people that their birth date '2000' are very few.

Display a visualization that shows the birth_date depending on their maximum duration_hour

In [24]:
fig = fgb_clean.groupby('member_birth_year')['duration_hour'].agg(['max']).plot(kind = 'scatter', y = 'max')
fig.update_layout(
    title="The maximum duration hour by member birth date",
    font=dict(
        family="Courier New, monospace",
    )
)
fig.show()

Multivariate Exploration

In [25]:
female = top_stations.loc[top_stations['member_gender'] == 'Female']
female.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3977 entries, 81 to 183335
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Day                      3977 non-null   object        
 1   duration_hour            3977 non-null   float64       
 2   start_time               3977 non-null   datetime64[ns]
 3   end_time                 3977 non-null   datetime64[ns]
 4   start_station_name       3977 non-null   object        
 5   end_station_name         3977 non-null   object        
 6   bike_id                  3977 non-null   int64         
 7   user_type                3977 non-null   object        
 8   member_birth_year        3977 non-null   float64       
 9   member_gender            3977 non-null   object        
 10  bike_share_for_all_trip  3977 non-null   object        
 11  TimeOfDay                3977 non-null   object        
dtypes: datetime64[ns](2), float64(2), int64(1), object(7)
memory usage: 403.9+ KB

Show the top 5 End stations by day of week with Females only

In [26]:
pio.templates.default = 'plotly'
fig = female.plot(kind='barh', y = 'end_station_name', color = 'Day'); # transparent visualization
fig.update_layout(
    title="Top 5 end stations with Females",
    yaxis_title="(End station names)",
    xaxis_title="(Value)",
    legend_title="variable",
    font=dict(
        family="Courier New, monospace",
        color="blue",
    )
)
fig.show()

I noticed that Females end station 'San Francisco Caltrain Station 2 (Townsend St at 4th St)' is the highest, with Tuesday being the most common day. However, Saturday and Sunday are so small in the visualization because its in a weekend.

Show top 5 End stations by TimeOfDay with Customers only

In [27]:
cust = top_stations.loc[top_stations['user_type'] == 'Customer']
cust.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1643 entries, 26 to 183219
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Day                      1643 non-null   object        
 1   duration_hour            1643 non-null   float64       
 2   start_time               1643 non-null   datetime64[ns]
 3   end_time                 1643 non-null   datetime64[ns]
 4   start_station_name       1643 non-null   object        
 5   end_station_name         1643 non-null   object        
 6   bike_id                  1643 non-null   int64         
 7   user_type                1643 non-null   object        
 8   member_birth_year        1643 non-null   float64       
 9   member_gender            1643 non-null   object        
 10  bike_share_for_all_trip  1643 non-null   object        
 11  TimeOfDay                1643 non-null   object        
dtypes: datetime64[ns](2), float64(2), int64(1), object(7)
memory usage: 166.9+ KB
In [28]:
pio.templates.default = 'plotly_dark'
fig = cust.plot(kind='barh', y = 'end_station_name', color = 'TimeOfDay'); # transparent visualization
fig.update_layout(
    title="Top 5 end stations by TimeOfDay with Customers",
    yaxis_title="(End station names)",
    xaxis_title="(Value)",
    legend_title="variable",
    font=dict(
        family="Courier New, monospace",
        color="lightblue"
    )
)
fig.show()

Talk about some of the relationships you observed in this part of the investigation.

Customer's end station 'San Francisco Ferry Building (Harry Bridges Plaza)' is the highest, with Day being more common than Night.

Were there any interesting or surprising interactions between features?

Yes I was really surprised when I saw that Day is more common than night with Customers, because I thought that it might be the opposite.

In [29]:
tuesday = fgb_clean.loc[fgb_clean['Day'] == 'Tuesday']
tuesday.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30022 entries, 14381 to 165721
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Day                      30022 non-null  object        
 1   duration_hour            30022 non-null  float64       
 2   start_time               30022 non-null  datetime64[ns]
 3   end_time                 30022 non-null  datetime64[ns]
 4   start_station_name       30022 non-null  object        
 5   end_station_name         30022 non-null  object        
 6   bike_id                  30022 non-null  int64         
 7   user_type                30022 non-null  object        
 8   member_birth_year        30022 non-null  float64       
 9   member_gender            30022 non-null  object        
 10  bike_share_for_all_trip  30022 non-null  object        
 11  TimeOfDay                30022 non-null  object        
dtypes: datetime64[ns](2), float64(2), int64(1), object(7)
memory usage: 3.0+ MB

Scatter the bike sharing of males and females during Tuesday

In [30]:
fig2 = tuesday.plot(kind='scatter', x = 'bike_share_for_all_trip', y = 'duration_hour', color = 'member_gender');
fig2.update_layout(
    title="Bike sharing of Males and Females during Tuesday", # depending on duration_hour
    yaxis_title="(Duration hour)",
    xaxis_title="(Bike sharing)",
    legend_title="variable",
    plot_bgcolor='rgb(10,10,10)',
    font=dict(
        family="Courier New, monospace",
        color="lightblue"
    )
)
fig2.show()

Bike sharing fo all trip are very few with males and females on Tuesday.

Storing Data

Save gathered, assessed, and cleaned dataset to a CSV file named "fgb_clean.csv".

In [31]:
fgb_clean.to_csv('fgb_cleaned.csv', index = False)

Finally, I finished my project. 👍